U.S. Pollution from 2000 - 2016

Group Members: Matthew Dang, Vincent Cheng, Brian Chung

Introduction

For this tutorial, we decided to analyzed the US pollution Data from 2000 - 2016. The data was gather from the US EPA and from various site across America. The data focus on four different pollutants, Nitrogen Dioxide(NO2, Sulphur Dioxide(SO2), Carbon Monoxide(CO) and Ozone(O3). As climate change has become a reoccuring issue in the international community. We wanted to see how america's effort has reduced climate change. We hope to find possible trends in data that could highlight current changes in reducing pollution. The data provide only contains pollution values from EPA locations image of pollution

The the main focus of the tutorial are:

  1. data curation
  2. parsing, and management
  3. exploratory data analysis
  4. hypothesis testing and machine learning

Data Collection and Clean (Data curation and Management)

The First part of the process is to Data Curation and Parsing. We need to import our dataset which can be found here https://www.kaggle.com/sogun3/uspollution/data. Data is never clean and needs to modified becuase of the missing data or additional infromation. This could mean removing information thats not relevant for our analysis, NaN values, or restructuring our dataframe to help our analysis later on. This will allow us to have clearer understanding of the what the data is proving and easier time parsing, analyzing, and later visualizing the data. We followed the principles of tidy data. If you want more information about tidy data, here is a link(https://www.jstatsoft.org/article/view/v059i10/v59i10.pdf).

Importing Libraries

First we will import some libraries that we will need for our analysis.

In [1]:
!pip install ggplot 
!pip install plotly
import pandas as pd 
import numpy as np
import matplotlib.pyplot as plt
import matplotlib
matplotlib.style.use('ggplot')
from ggplot import *
import plotly.plotly as py
Requirement already satisfied: ggplot in /opt/conda/lib/python3.6/site-packages
Requirement already satisfied: matplotlib in /opt/conda/lib/python3.6/site-packages (from ggplot)
Requirement already satisfied: cycler in /opt/conda/lib/python3.6/site-packages/cycler-0.10.0-py3.6.egg (from ggplot)
Requirement already satisfied: patsy>=0.4 in /opt/conda/lib/python3.6/site-packages (from ggplot)
Requirement already satisfied: numpy in /opt/conda/lib/python3.6/site-packages (from ggplot)
Requirement already satisfied: scipy in /opt/conda/lib/python3.6/site-packages (from ggplot)
Requirement already satisfied: pandas in /opt/conda/lib/python3.6/site-packages (from ggplot)
Requirement already satisfied: brewer2mpl in /opt/conda/lib/python3.6/site-packages (from ggplot)
Requirement already satisfied: statsmodels in /opt/conda/lib/python3.6/site-packages (from ggplot)
Requirement already satisfied: six in /opt/conda/lib/python3.6/site-packages (from ggplot)
Requirement already satisfied: python-dateutil in /opt/conda/lib/python3.6/site-packages (from matplotlib->ggplot)
Requirement already satisfied: pytz in /opt/conda/lib/python3.6/site-packages (from matplotlib->ggplot)
Requirement already satisfied: pyparsing!=2.0.4,!=2.1.2,!=2.1.6,>=1.5.6 in /opt/conda/lib/python3.6/site-packages (from matplotlib->ggplot)
Requirement already satisfied: plotly in /opt/conda/lib/python3.6/site-packages
Requirement already satisfied: nbformat>=4.2 in /opt/conda/lib/python3.6/site-packages (from plotly)
Requirement already satisfied: six in /opt/conda/lib/python3.6/site-packages (from plotly)
Requirement already satisfied: requests in /opt/conda/lib/python3.6/site-packages (from plotly)
Requirement already satisfied: decorator>=4.0.6 in /opt/conda/lib/python3.6/site-packages (from plotly)
Requirement already satisfied: pytz in /opt/conda/lib/python3.6/site-packages (from plotly)
Requirement already satisfied: ipython-genutils in /opt/conda/lib/python3.6/site-packages (from nbformat>=4.2->plotly)
Requirement already satisfied: jsonschema!=2.5.0,>=2.4 in /opt/conda/lib/python3.6/site-packages (from nbformat>=4.2->plotly)
Requirement already satisfied: traitlets>=4.1 in /opt/conda/lib/python3.6/site-packages (from nbformat>=4.2->plotly)
Requirement already satisfied: jupyter-core in /opt/conda/lib/python3.6/site-packages (from nbformat>=4.2->plotly)
Requirement already satisfied: chardet<3.1.0,>=3.0.2 in /opt/conda/lib/python3.6/site-packages (from requests->plotly)
Requirement already satisfied: idna<2.7,>=2.5 in /opt/conda/lib/python3.6/site-packages (from requests->plotly)
Requirement already satisfied: urllib3<1.23,>=1.21.1 in /opt/conda/lib/python3.6/site-packages (from requests->plotly)
Requirement already satisfied: certifi>=2017.4.17 in /opt/conda/lib/python3.6/site-packages (from requests->plotly)

Import Dataset

In [2]:
data_raw = pd.read_csv("data/pollution_us_2000_2016.csv")
data_raw.head()
Out[2]:
Unnamed: 0 State Code County Code Site Num Address State County City Date Local NO2 Units ... SO2 Units SO2 Mean SO2 1st Max Value SO2 1st Max Hour SO2 AQI CO Units CO Mean CO 1st Max Value CO 1st Max Hour CO AQI
0 0 4 13 3002 1645 E ROOSEVELT ST-CENTRAL PHOENIX STN Arizona Maricopa Phoenix 2000-01-01 Parts per billion ... Parts per billion 3.000000 9.0 21 13.0 Parts per million 1.145833 4.2 21 NaN
1 1 4 13 3002 1645 E ROOSEVELT ST-CENTRAL PHOENIX STN Arizona Maricopa Phoenix 2000-01-01 Parts per billion ... Parts per billion 3.000000 9.0 21 13.0 Parts per million 0.878947 2.2 23 25.0
2 2 4 13 3002 1645 E ROOSEVELT ST-CENTRAL PHOENIX STN Arizona Maricopa Phoenix 2000-01-01 Parts per billion ... Parts per billion 2.975000 6.6 23 NaN Parts per million 1.145833 4.2 21 NaN
3 3 4 13 3002 1645 E ROOSEVELT ST-CENTRAL PHOENIX STN Arizona Maricopa Phoenix 2000-01-01 Parts per billion ... Parts per billion 2.975000 6.6 23 NaN Parts per million 0.878947 2.2 23 25.0
4 4 4 13 3002 1645 E ROOSEVELT ST-CENTRAL PHOENIX STN Arizona Maricopa Phoenix 2000-01-02 Parts per billion ... Parts per billion 1.958333 3.0 22 4.0 Parts per million 0.850000 1.6 23 NaN

5 rows × 29 columns

Tidying Data

The Data is imported and stored using a Pandas module. Pandas is a common module used for storing data and manipulating data. Pandas is useful for modifying and editing the data. Then, the date is plotted and analyzed for possible trends. For more information on pandas and how to use pandas, here is a link (http://www.gregreda.com/2013/10/26/working-with-pandas-dataframes/). Here is also a short tutorial about the basic functions of panada (https://www.dataquest.io/blog/pandas-python-tutorial/).

In [3]:
# remove any rows containg NaN values
data_raw = data_raw.dropna()
data_raw.head()
Out[3]:
Unnamed: 0 State Code County Code Site Num Address State County City Date Local NO2 Units ... SO2 Units SO2 Mean SO2 1st Max Value SO2 1st Max Hour SO2 AQI CO Units CO Mean CO 1st Max Value CO 1st Max Hour CO AQI
1 1 4 13 3002 1645 E ROOSEVELT ST-CENTRAL PHOENIX STN Arizona Maricopa Phoenix 2000-01-01 Parts per billion ... Parts per billion 3.000000 9.0 21 13.0 Parts per million 0.878947 2.2 23 25.0
5 5 4 13 3002 1645 E ROOSEVELT ST-CENTRAL PHOENIX STN Arizona Maricopa Phoenix 2000-01-02 Parts per billion ... Parts per billion 1.958333 3.0 22 4.0 Parts per million 1.066667 2.3 0 26.0
9 9 4 13 3002 1645 E ROOSEVELT ST-CENTRAL PHOENIX STN Arizona Maricopa Phoenix 2000-01-03 Parts per billion ... Parts per billion 5.250000 11.0 19 16.0 Parts per million 1.762500 2.5 8 28.0
13 13 4 13 3002 1645 E ROOSEVELT ST-CENTRAL PHOENIX STN Arizona Maricopa Phoenix 2000-01-04 Parts per billion ... Parts per billion 7.083333 16.0 8 23.0 Parts per million 1.829167 3.0 23 34.0
17 17 4 13 3002 1645 E ROOSEVELT ST-CENTRAL PHOENIX STN Arizona Maricopa Phoenix 2000-01-05 Parts per billion ... Parts per billion 8.708333 15.0 7 21.0 Parts per million 2.700000 3.7 2 42.0

5 rows × 29 columns

In [4]:
# view our variables
list(data_raw.columns)
Out[4]:
['Unnamed: 0',
 'State Code',
 'County Code',
 'Site Num',
 'Address',
 'State',
 'County',
 'City',
 'Date Local',
 'NO2 Units',
 'NO2 Mean',
 'NO2 1st Max Value',
 'NO2 1st Max Hour',
 'NO2 AQI',
 'O3 Units',
 'O3 Mean',
 'O3 1st Max Value',
 'O3 1st Max Hour',
 'O3 AQI',
 'SO2 Units',
 'SO2 Mean',
 'SO2 1st Max Value',
 'SO2 1st Max Hour',
 'SO2 AQI',
 'CO Units',
 'CO Mean',
 'CO 1st Max Value',
 'CO 1st Max Hour',
 'CO AQI']

There are some uncessary variables here that we don't need so we can remove them.

In [5]:
drop = ['Unnamed: 0',
 'State Code',
 'County Code',
 'Site Num',
 'Address',
 'County',
 'City',
 'NO2 Units',
 'NO2 1st Max Value',
 'NO2 1st Max Hour',
 'O3 Units',
 'O3 1st Max Value',
 'O3 1st Max Hour',
 'SO2 Units',
 'SO2 1st Max Value',
 'SO2 1st Max Hour',
 'CO Units',
 'CO 1st Max Value',
 'CO 1st Max Hour']
data = data_raw.drop(drop, axis = 1)
data.head()
Out[5]:
State Date Local NO2 Mean NO2 AQI O3 Mean O3 AQI SO2 Mean SO2 AQI CO Mean CO AQI
1 Arizona 2000-01-01 19.041667 46 0.022500 34 3.000000 13.0 0.878947 25.0
5 Arizona 2000-01-02 22.958333 34 0.013375 27 1.958333 4.0 1.066667 26.0
9 Arizona 2000-01-03 38.125000 48 0.007958 14 5.250000 16.0 1.762500 28.0
13 Arizona 2000-01-04 40.260870 72 0.014167 28 7.083333 23.0 1.829167 34.0
17 Arizona 2000-01-05 48.450000 58 0.006667 10 8.708333 21.0 2.700000 42.0

We can also change the states to their abbreviations. This will make our graphs more readable later on.

In [6]:
us_state_abbrev = {
    'Alabama': 'AL',
    'Alaska': 'AK',
    'Arizona': 'AZ',
    'Arkansas': 'AR',
    'California': 'CA',
    'Colorado': 'CO',
    'Connecticut': 'CT',
    'Delaware': 'DE',
    'Florida': 'FL',
    'Georgia': 'GA',
    'Hawaii': 'HI',
    'Idaho': 'ID',
    'Illinois': 'IL',
    'Indiana': 'IN',
    'Iowa': 'IA',
    'Kansas': 'KS',
    'Kentucky': 'KY',
    'Louisiana': 'LA',
    'Maine': 'ME',
    'Maryland': 'MD',
    'Massachusetts': 'MA',
    'Michigan': 'MI',
    'Minnesota': 'MN',
    'Mississippi': 'MS',
    'Missouri': 'MO',
    'Montana': 'MT',
    'Nebraska': 'NE',
    'Nevada': 'NV',
    'New Hampshire': 'NH',
    'New Jersey': 'NJ',
    'New Mexico': 'NM',
    'New York': 'NY',
    'North Carolina': 'NC',
    'North Dakota': 'ND',
    'Ohio': 'OH',
    'Oklahoma': 'OK',
    'Oregon': 'OR',
    'Pennsylvania': 'PA',
    'Rhode Island': 'RI',
    'South Carolina': 'SC',
    'South Dakota': 'SD',
    'Tennessee': 'TN',
    'Texas': 'TX',
    'Utah': 'UT',
    'Vermont': 'VT',
    'Virginia': 'VA',
    'Washington': 'WA',
    'West Virginia': 'WV',
    'Wisconsin': 'WI',
    'Wyoming': 'WY',
}

data['state'] = data['State'].map(us_state_abbrev)
data['NO2Mean'] = data['NO2 Mean']
data['NO2AQI'] = data['NO2 AQI']
data['O3Mean'] = data['O3 Mean'] * 1000
data['O3AQI'] = data['O3 AQI']
data['SO2Mean'] = data['SO2 Mean']
data['SO2AQI'] = data['SO2 AQI']
data['COMean'] = data['CO Mean'] * 1000
data['COAQI'] = data['CO AQI']
data = data.drop('State', 1)
# data = data.drop('NO2 Mean', 1)
# data = data.drop('NO2 AQI', 1)
# data = data.drop('O3 Mean', 1)
# data = data.drop('O3 AQI', 1)
# data = data.drop('SO2 Mean', 1)
# data = data.drop('SO2 AQI', 1)
# data = data.drop('CO Mean', 1)
# data = data.drop('CO AQI', 1)
data
Out[6]:
Date Local NO2 Mean NO2 AQI O3 Mean O3 AQI SO2 Mean SO2 AQI CO Mean CO AQI state NO2Mean NO2AQI O3Mean O3AQI SO2Mean SO2AQI COMean COAQI
1 2000-01-01 19.041667 46 0.022500 34 3.000000 13.0 0.878947 25.0 AZ 19.041667 46 22.500 34 3.000000 13.0 878.947 25.0
5 2000-01-02 22.958333 34 0.013375 27 1.958333 4.0 1.066667 26.0 AZ 22.958333 34 13.375 27 1.958333 4.0 1066.667 26.0
9 2000-01-03 38.125000 48 0.007958 14 5.250000 16.0 1.762500 28.0 AZ 38.125000 48 7.958 14 5.250000 16.0 1762.500 28.0
13 2000-01-04 40.260870 72 0.014167 28 7.083333 23.0 1.829167 34.0 AZ 40.260870 72 14.167 28 7.083333 23.0 1829.167 34.0
17 2000-01-05 48.450000 58 0.006667 10 8.708333 21.0 2.700000 42.0 AZ 48.450000 58 6.667 10 8.708333 21.0 2700.000 42.0
21 2000-01-06 39.950000 71 0.011750 21 6.761905 24.0 2.308333 41.0 AZ 39.950000 71 11.750 21 6.761905 24.0 2308.333 41.0
25 2000-01-07 29.625000 41 0.011625 20 8.666667 30.0 1.829167 40.0 AZ 29.625000 41 11.625 20 8.666667 30.0 1829.167 40.0
29 2000-01-08 29.666667 39 0.009750 17 8.250000 26.0 2.787500 57.0 AZ 29.666667 39 9.750 17 8.250000 26.0 2787.500 57.0
33 2000-01-09 25.083333 35 0.010792 19 6.500000 19.0 1.675000 32.0 AZ 25.083333 35 10.792 19 6.500000 19.0 1675.000 32.0
37 2000-01-10 37.666667 68 0.008458 13 9.958333 30.0 2.179167 42.0 AZ 37.666667 68 8.458 13 9.958333 30.0 2179.167 42.0
41 2000-01-11 50.500000 80 0.008417 14 11.625000 34.0 2.533333 51.0 AZ 50.500000 80 8.417 14 11.625000 34.0 2533.333 51.0
45 2000-01-12 49.125000 80 0.008208 12 10.916667 37.0 2.316667 48.0 AZ 49.125000 80 8.208 12 10.916667 37.0 2316.667 48.0
49 2000-01-13 73.285714 104 0.006167 8 10.952381 30.0 2.958333 52.0 AZ 73.285714 104 6.167 8 10.952381 30.0 2958.333 52.0
53 2000-01-14 66.541667 105 0.008708 15 11.625000 41.0 3.575000 59.0 AZ 66.541667 105 8.708 15 11.625000 41.0 3575.000 59.0
57 2000-01-15 53.166667 86 0.010625 19 9.583333 31.0 2.175000 51.0 AZ 53.166667 86 10.625 19 9.583333 31.0 2175.000 51.0
61 2000-01-16 45.750000 71 0.010750 19 6.458333 19.0 1.962500 54.0 AZ 45.750000 71 10.750 19 6.458333 19.0 1962.500 54.0
65 2000-01-17 59.250000 101 0.008375 13 8.500000 27.0 1.987500 44.0 AZ 59.250000 101 8.375 13 8.500000 27.0 1987.500 44.0
69 2000-01-18 66.791667 101 0.006333 8 12.166667 33.0 2.891667 55.0 AZ 66.791667 101 6.333 8 12.166667 33.0 2891.667 55.0
73 2000-01-19 59.041667 86 0.006958 11 10.166667 31.0 2.650000 55.0 AZ 59.041667 86 6.958 11 10.166667 31.0 2650.000 55.0
77 2000-01-20 48.357143 63 0.008500 14 9.391304 24.0 2.366667 47.0 AZ 48.357143 63 8.500 14 9.391304 24.0 2366.667 47.0
81 2000-01-21 54.500000 58 0.008542 12 7.958333 24.0 1.912500 40.0 AZ 54.500000 58 8.542 12 7.958333 24.0 1912.500 40.0
85 2000-01-22 38.083333 62 0.010583 17 6.875000 20.0 1.912500 43.0 AZ 38.083333 62 10.583 17 6.875000 20.0 1912.500 43.0
89 2000-01-23 37.958333 64 0.016292 31 6.791667 20.0 1.979167 45.0 AZ 37.958333 64 16.292 31 6.791667 20.0 1979.167 45.0
93 2000-01-24 53.333333 76 0.011417 19 9.083333 26.0 1.979167 38.0 AZ 53.333333 76 11.417 19 9.083333 26.0 1979.167 38.0
97 2000-01-25 42.583333 56 0.009917 17 7.833333 16.0 1.554167 33.0 AZ 42.583333 56 9.917 17 7.833333 16.0 1554.167 33.0
101 2000-01-26 27.217391 40 0.017750 30 3.363636 10.0 0.866667 16.0 AZ 27.217391 40 17.750 30 3.363636 10.0 866.667 16.0
105 2000-01-27 33.375000 48 0.014000 24 3.684211 16.0 0.925000 18.0 AZ 33.375000 48 14.000 24 3.684211 16.0 925.000 18.0
109 2000-01-28 36.875000 51 0.013000 23 3.952381 20.0 1.454167 22.0 AZ 36.875000 51 13.000 23 3.952381 20.0 1454.167 22.0
113 2000-01-29 39.625000 64 0.017083 31 6.157895 24.0 1.708333 35.0 AZ 39.625000 64 17.083 31 6.157895 24.0 1708.333 35.0
117 2000-01-30 35.208333 54 0.016750 30 2.450000 14.0 1.491667 43.0 AZ 35.208333 54 16.750 30 2.450000 14.0 1491.667 43.0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
1746542 2016-03-01 1.195652 3 0.041750 43 -0.104545 0.0 0.100000 1.0 WY 1.195652 3 41.750 43 -0.104545 0.0 100.000 1.0
1746546 2016-03-02 0.870833 1 0.043375 44 -0.120833 0.0 0.100000 1.0 WY 0.870833 1 43.375 44 -0.120833 0.0 100.000 1.0
1746550 2016-03-03 3.517391 10 0.043958 48 -0.050000 0.0 0.141667 2.0 WY 3.517391 10 43.958 48 -0.050000 0.0 141.667 2.0
1746554 2016-03-04 3.429167 16 0.034833 41 -0.083333 0.0 0.100000 1.0 WY 3.429167 16 34.833 41 -0.083333 0.0 100.000 1.0
1746558 2016-03-05 12.630435 33 0.034625 41 0.368182 1.0 0.100000 1.0 WY 12.630435 33 34.625 41 0.368182 1.0 100.000 1.0
1746562 2016-03-06 1.826087 3 0.040750 45 0.030435 0.0 0.095833 1.0 WY 1.826087 3 40.750 45 0.030435 0.0 95.833 1.0
1746566 2016-03-07 3.023529 11 0.032933 36 0.075000 0.0 0.075000 1.0 WY 3.023529 11 32.933 36 0.075000 0.0 75.000 1.0
1746570 2016-03-08 2.089474 8 0.035063 40 0.077778 0.0 0.081250 1.0 WY 2.089474 8 35.063 40 0.077778 0.0 81.250 1.0
1746574 2016-03-09 6.094118 18 0.034857 42 0.050000 0.0 0.100000 1.0 WY 6.094118 18 34.857 42 0.050000 0.0 100.000 1.0
1746578 2016-03-11 5.465217 19 0.036750 45 0.077273 0.0 0.100000 1.0 WY 5.465217 19 36.750 45 0.077273 0.0 100.000 1.0
1746582 2016-03-12 3.975000 10 0.038375 43 0.383333 4.0 0.095833 1.0 WY 3.975000 10 38.375 43 0.383333 4.0 95.833 1.0
1746586 2016-03-13 1.295455 3 0.040458 43 -0.019048 0.0 0.100000 1.0 WY 1.295455 3 40.458 43 -0.019048 0.0 100.000 1.0
1746590 2016-03-14 5.108333 17 0.042375 45 0.025000 0.0 0.100000 1.0 WY 5.108333 17 42.375 45 0.025000 0.0 100.000 1.0
1746594 2016-03-15 0.773913 1 0.042167 41 -0.031818 0.0 0.100000 1.0 WY 0.773913 1 42.167 41 -0.031818 0.0 100.000 1.0
1746598 2016-03-16 1.404167 2 0.043833 44 -0.087500 0.0 0.100000 1.0 WY 1.404167 2 43.833 44 -0.087500 0.0 100.000 1.0
1746602 2016-03-17 1.234783 3 0.043542 44 -0.027273 0.0 0.083333 1.0 WY 1.234783 3 43.542 44 -0.027273 0.0 83.333 1.0
1746606 2016-03-18 1.262500 2 0.037375 39 0.012500 0.0 0.100000 1.0 WY 1.262500 2 37.375 39 0.012500 0.0 100.000 1.0
1746610 2016-03-19 1.530435 4 0.039500 40 -0.163636 0.0 0.100000 1.0 WY 1.530435 4 39.500 40 -0.163636 0.0 100.000 1.0
1746614 2016-03-20 7.478261 23 0.040708 44 0.086957 0.0 0.100000 1.0 WY 7.478261 23 40.708 44 0.086957 0.0 100.000 1.0
1746618 2016-03-21 2.356522 18 0.046000 49 -0.036364 0.0 0.085714 1.0 WY 2.356522 18 46.000 49 -0.036364 0.0 85.714 1.0
1746622 2016-03-22 4.825000 25 0.037667 47 0.091304 1.0 0.100000 1.0 WY 4.825000 25 37.667 47 0.091304 1.0 100.000 1.0
1746626 2016-03-23 1.273913 2 0.035125 39 0.000000 0.0 0.100000 1.0 WY 1.273913 2 35.125 39 0.000000 0.0 100.000 1.0
1746630 2016-03-24 2.212500 8 0.044792 45 -0.083333 0.0 0.100000 1.0 WY 2.212500 8 44.792 45 -0.083333 0.0 100.000 1.0
1746634 2016-03-25 1.626087 9 0.041708 45 -0.031818 0.0 0.100000 1.0 WY 1.626087 9 41.708 45 -0.031818 0.0 100.000 1.0
1746638 2016-03-26 3.758333 25 0.033292 38 -0.050000 0.0 0.100000 1.0 WY 3.758333 25 33.292 38 -0.050000 0.0 100.000 1.0
1746642 2016-03-27 4.277273 22 0.041958 46 -0.095238 0.0 0.100000 1.0 WY 4.277273 22 41.958 46 -0.095238 0.0 100.000 1.0
1746646 2016-03-28 8.317391 21 0.041292 48 0.117391 0.0 0.100000 1.0 WY 8.317391 21 41.292 48 0.117391 0.0 100.000 1.0
1746650 2016-03-29 2.564706 3 0.028000 37 0.143750 0.0 0.006667 1.0 WY 2.564706 3 28.000 37 0.143750 0.0 6.667 1.0
1746654 2016-03-30 1.083333 1 0.043917 44 0.016667 0.0 0.091667 1.0 WY 1.083333 1 43.917 44 0.016667 0.0 91.667 1.0
1746658 2016-03-31 0.939130 1 0.045263 44 -0.022727 0.0 0.100000 1.0 WY 0.939130 1 45.263 44 -0.022727 0.0 100.000 1.0

436876 rows × 18 columns

For our analysis we are interested in changes by year so we will parse just the year from each date.

In [7]:
data['year'] = data['Date Local'].apply(lambda x: x[0:4])
data.drop('Date Local',1)

data_avg = data.groupby(['state', 'year'],as_index=False).mean()
data_avg.head()
data.groupby(['state'],as_index=False).mean()
Out[7]:
state NO2 Mean NO2 AQI O3 Mean O3 AQI SO2 Mean SO2 AQI CO Mean CO AQI NO2Mean NO2AQI O3Mean O3AQI SO2Mean SO2AQI COMean COAQI
0 AK 11.332871 19.580972 0.012790 17.712551 6.101601 14.506073 0.424387 6.528340 11.332871 19.580972 12.789704 17.712551 6.101601 14.506073 424.386524 6.528340
1 AL 9.410896 21.228900 0.024601 36.831202 1.051202 7.005115 0.212973 3.851662 9.410896 21.228900 24.600926 36.831202 1.051202 7.005115 212.972913 3.851662
2 AR 9.754085 21.487772 0.026168 35.035213 1.397471 2.975883 0.423522 5.929914 9.754085 21.487772 26.167844 35.035213 1.397471 2.975883 423.522225 5.929914
3 AZ 19.066303 36.106818 0.024991 39.004980 1.373707 4.212319 0.492256 9.190681 19.066303 36.106818 24.991157 39.004980 1.373707 4.212319 492.256226 9.190681
4 CA 13.654535 24.114337 0.026053 35.721988 1.155982 3.598922 0.449641 7.405755 13.654535 24.114337 26.052534 35.721988 1.155982 3.598922 449.640515 7.405755
5 CO 19.635766 35.961709 0.023550 34.675037 1.508337 10.606522 0.445569 7.724804 19.635766 35.961709 23.550483 34.675037 1.508337 10.606522 445.569214 7.724804
6 CT 8.991660 18.458773 0.028919 37.149806 0.926693 3.222504 0.251569 3.586396 8.991660 18.458773 28.918966 37.149806 0.926693 3.222504 251.568905 3.586396
7 DE 11.591813 21.557756 0.026506 35.410341 1.033663 2.828383 0.261954 3.839384 11.591813 21.557756 26.506400 35.410341 1.033663 2.828383 261.953818 3.839384
8 FL 7.361649 16.382003 0.026800 36.000154 0.500828 2.876370 0.426557 5.921593 7.361649 16.382003 26.799831 36.000154 0.500828 2.876370 426.556812 5.921593
9 GA 11.227470 24.146632 0.020588 34.485492 0.494373 1.793264 0.321156 5.273057 11.227470 24.146632 20.587830 34.485492 0.494373 1.793264 321.155567 5.273057
10 HI 3.162423 8.653740 0.025147 26.663189 0.999316 2.284646 0.367278 4.481693 3.162423 8.653740 25.146816 26.663189 0.999316 2.284646 367.278108 4.481693
11 IA 6.788070 14.291460 0.027864 33.837098 0.413967 1.396194 0.218407 3.043162 6.788070 14.291460 27.864315 33.837098 0.413967 1.396194 218.406636 3.043162
12 ID 8.932213 23.336245 0.027001 34.956332 0.337252 0.823144 0.180127 2.875546 8.932213 23.336245 27.000908 34.956332 0.337252 0.823144 180.126825 2.875546
13 IL 15.566840 28.378829 0.022657 31.986758 2.700515 13.132339 0.394352 6.258057 15.566840 28.378829 22.656696 31.986758 2.700515 13.132339 394.351754 6.258057
14 IN 12.117072 24.055157 0.029419 41.589486 3.099869 14.666188 0.358315 5.395001 12.117072 24.055157 29.419050 41.589486 3.099869 14.666188 358.315279 5.395001
15 KS 11.088594 21.578325 0.026002 34.362978 2.381229 9.391564 0.410451 6.541608 11.088594 21.578325 26.001945 34.362978 2.381229 9.391564 410.450642 6.541608
16 KY 11.983817 25.142818 0.029096 43.018498 3.759421 18.559576 0.165087 3.178455 11.983817 25.142818 29.095876 43.018498 3.759421 18.559576 165.086981 3.178455
17 LA 13.756172 26.738985 0.023353 34.019099 2.354302 11.316803 0.413524 6.311275 13.756172 26.738985 23.352762 34.019099 2.354302 11.316803 413.523614 6.311275
18 MA 18.646895 29.778458 0.020667 25.912495 2.576945 8.006489 0.319554 5.540044 18.646895 29.778458 20.666960 25.912495 2.576945 8.006489 319.554088 5.540044
19 MD 9.976931 19.781080 0.026587 36.865999 1.677965 7.063010 0.282743 4.369735 9.976931 19.781080 26.586813 36.865999 1.677965 7.063010 282.743387 4.369735
20 ME 5.137134 11.286996 0.025041 28.069082 0.947963 2.988317 0.226094 3.087199 5.137134 11.286996 25.040655 28.069082 0.947963 2.988317 226.093653 3.087199
21 MI 16.818226 31.851490 0.027338 40.022960 3.132039 17.968735 0.350807 6.469956 16.818226 31.851490 27.337893 40.022960 3.132039 17.968735 350.806573 6.469956
22 MN 6.835440 15.875561 0.026922 33.961883 0.688396 1.156951 0.245750 3.450673 6.835440 15.875561 26.922170 33.961883 0.688396 1.156951 245.750453 3.450673
23 MO 14.958263 29.166363 0.027788 41.906610 3.461487 18.316555 0.469007 7.413180 14.958263 29.166363 27.787820 41.906610 3.461487 18.316555 469.007202 7.413180
24 NC 10.553807 22.404223 0.029191 44.370394 1.478311 8.604396 0.355666 5.628313 10.553807 22.404223 29.190881 44.370394 1.478311 8.604396 355.666226 5.628313
25 ND 4.793360 12.426134 0.025893 29.978947 0.229954 0.412341 0.169603 2.053358 4.793360 12.426134 25.893216 29.978947 0.229954 0.412341 169.603361 2.053358
26 NH 7.397317 14.662075 0.027946 34.541972 1.424841 7.638399 0.345901 5.073181 7.397317 14.662075 27.946191 34.541972 1.424841 7.638399 345.901093 5.073181
27 NJ 18.596304 31.677091 0.023566 34.313332 3.130810 11.129134 0.403735 6.437528 18.596304 31.677091 23.566121 34.313332 3.130810 11.129134 403.735256 6.437528
28 NM 12.334884 24.465286 0.031703 41.327548 0.627760 1.179171 0.211173 3.479843 12.334884 24.465286 31.703235 41.327548 0.627760 1.179171 211.172541 3.479843
29 NV 12.046286 23.946760 0.032033 40.502270 0.352389 1.110607 0.221799 3.658688 12.046286 23.946760 32.033489 40.502270 0.352389 1.110607 221.798889 3.658688
30 NY 18.489442 30.581480 0.023638 31.782739 4.679582 14.065706 0.358441 5.657910 18.489442 30.581480 23.637936 31.782739 4.679582 14.065706 358.441476 5.657910
31 OH 12.113578 22.336240 0.024691 31.715208 2.735786 15.372340 0.271291 4.010115 12.113578 22.336240 24.691149 31.715208 2.735786 15.372340 271.291323 4.010115
32 OK 6.758890 14.892629 0.031374 41.029135 0.730165 3.744515 0.140652 2.486941 6.758890 14.892629 31.373969 41.029135 0.730165 3.744515 140.652125 2.486941
33 OR 9.649336 17.244151 0.019264 25.169888 0.990416 2.035266 0.305899 5.323160 9.649336 17.244151 19.264026 25.169888 0.990416 2.035266 305.899333 5.323160
34 PA 12.416034 23.812809 0.026426 39.714815 4.057368 15.803608 0.242903 3.986082 12.416034 23.812809 26.426383 39.714815 4.057368 15.803608 242.903156 3.986082
35 RI 7.147757 14.375949 0.030072 37.632911 0.530564 1.668987 0.225506 3.162025 7.147757 14.375949 30.072344 37.632911 0.530564 1.668987 225.505734 3.162025
36 SC 1.939410 4.936353 0.031676 37.256426 0.834706 4.758874 0.125366 1.604651 1.939410 4.936353 31.675790 37.256426 0.834706 4.758874 125.365998 1.604651
37 SD 5.077920 11.074110 0.030023 34.358999 0.486860 0.669394 0.184031 2.385948 5.077920 11.074110 30.022847 34.358999 0.486860 0.669394 184.030736 2.385948
38 TN 1.697732 3.554489 0.037796 45.365319 0.830904 2.329678 0.439324 4.878684 1.697732 3.554489 37.795529 45.365319 0.830904 2.329678 439.323500 4.878684
39 TX 11.591420 23.564385 0.025765 35.801798 0.996449 4.674100 0.248725 4.450385 11.591420 23.564385 25.764951 35.801798 0.996449 4.674100 248.725057 4.450385
40 UT 13.184446 23.486848 0.032074 42.293955 0.414634 1.151823 0.355123 5.413475 13.184446 23.486848 32.073565 42.293955 0.414634 1.151823 355.122986 5.413475
41 VA 10.535442 20.266608 0.027899 38.054464 3.000386 9.857911 0.374470 5.359833 10.535442 20.266608 27.899466 38.054464 3.000386 9.857911 374.470200 5.359833
42 WA 10.231211 21.717842 0.020866 27.705394 0.779516 1.946058 0.206082 2.730290 10.231211 21.717842 20.866365 27.705394 0.779516 1.946058 206.081631 2.730290
43 WI 14.968732 26.166227 0.022432 27.321900 2.619945 8.279683 0.350330 5.248021 14.968732 26.166227 22.432179 27.321900 2.619945 8.279683 350.330069 5.248021
44 WY 3.292677 9.795768 0.038065 41.444649 0.349093 1.815087 0.107623 1.486047 3.292677 9.795768 38.064671 41.444649 0.349093 1.815087 107.622803 1.486047

Data Visualization and Exploratory Data Analysis

We will use pyplot which is a MATLAB like plotting framework. Matplotlib has thorough documentation and lots of sample code if you want to get started. Here is a link to a basic tutorial on using pyplot: https://matplotlib.org/2.0.2/users/pyplot_tutorial.html

Pollutant Mean Concentration vs Year

We will first look at the change in the change in levels from the various pollutants.

In [8]:
plt.figure(figsize=(20, 15))
for state, grp in data_avg.groupby(['state']):
    plt.plot(grp['year'], grp['SO2 Mean'], label=state)
plt.xlabel("Year")
plt.ylabel("SO2 Mean")
plt.title("SO2 Mean vs Year")
plt.legend(loc='best')    
plt.show()

plt.figure(figsize=(20, 15))
for state, grp in data_avg.groupby(['state']):
    plt.plot(grp['year'], grp['CO Mean'], label=state)
plt.xlabel("Year")
plt.ylabel("CO Mean")
plt.title("CO Mean vs Year")
plt.legend(loc='best')    
plt.show()

plt.figure(figsize=(20, 15))
for state, grp in data_avg.groupby(['state']):
    plt.plot(grp['year'], grp['NO2 Mean'], label=state)
plt.xlabel("Year")
plt.ylabel("NO2 Mean")
plt.title("NO2 Mean vs Year")
plt.legend(loc='best')    
plt.show()

plt.figure(figsize=(20, 15))
for state, grp in data_avg.groupby(['state']):
    plt.plot(grp['year'], grp['O3 Mean'], label=state)
plt.xlabel("Year")
plt.ylabel("O3 Mean")
plt.title("O3 Mean vs Year")
plt.legend(loc='best')    
plt.show()

Pollutant Concentration vs Year Analysis

From these plots we can see tha SO2 and CO mean concentration generally decrease over time. However, NO2 and O3 mean concentration don't follow a linear trend and fluctuates over time.

Air Quality Index vs Year

We will next look at the change in air quality index over time for each pollutant

In [9]:
plt.figure(figsize=(20, 15))
for state, grp in data_avg.groupby(['state']):
    plt.plot(grp['year'], grp['SO2 AQI'], label=state)
plt.xlabel("Year")
plt.ylabel("SO2 AQI")
plt.title("SO2 AQI vs Year")
plt.legend(loc='best')    
plt.show()

plt.figure(figsize=(20, 15))
for state, grp in data_avg.groupby(['state']):
    plt.plot(grp['year'], grp['CO AQI'], label=state)
plt.xlabel("Year")
plt.ylabel("CO AQI")
plt.title("CO AQI vs Year")
plt.legend(loc='best')    
plt.show()

plt.figure(figsize=(20, 15))
for state, grp in data_avg.groupby(['state']):
    plt.plot(grp['year'], grp['NO2 AQI'], label=state)
plt.xlabel("Year")
plt.ylabel("NO2 AQI")
plt.title("NO2 AQI vs Year")
plt.legend(loc='best')    
plt.show()

plt.figure(figsize=(20, 15))
for state, grp in data_avg.groupby(['state']):
    plt.plot(grp['year'], grp['O3 AQI'], label=state)
plt.xlabel("Year")
plt.ylabel("O3 AQI")
plt.title("O3 AQI vs Year")
plt.legend(loc='best')    
plt.show()

Pollutant Air Quality Index vs Year Analysis

From these plots we can see that SO2 AQI and CO AQI have a negative linear relationship with year. CO AQI and NO2 AQI also generally decrease over time, but fluctuates much more than SO2 AQI and CO AQI.

Violin Plot

The previous plots are a bit complicated to read, but we can use other forms of data visualizations to analyze the data more clearly. Another way to visualize this data is with a violin plot. A violin plot is used to visualize the distribution of data and its proability density. This plot is a combination of a box plot and a density plot. We will plot pollutant concentration over time and pollutant air quality index over time. We will use the ggplot package to plot the violin plots. ggplot is a plotting system for Python based on R's ggplot2. This package is used for making intelligent plots fast with minimal code. You can refer to the documentation and tutorials on how to use ggplot here: http://ggplot.yhathq.com/docs/index.html

In [10]:
f1=ggplot(aes(x='year', y='SO2 Mean'), data=data_avg) +\
    geom_violin() +\
    labs(title="SO2 Mean over time",
         x = "year",
         y = "SO2 Mean")

f2=ggplot(aes(x='year', y='CO Mean'), data=data_avg) +\
    geom_violin() +\
    labs(title="CO Mean over time",
         x = "year",
         y = "CO Mean")
    
f3=ggplot(aes(x='year', y='NO2 Mean'), data=data_avg) +\
    geom_violin() +\
    labs(title="NO2 Mean over time",
         x = "year",
         y = "NO2 Mean")
    
f4=ggplot(aes(x='year', y='O3 Mean'), data=data_avg) +\
    geom_violin() +\
    labs(title="O3 Mean over time",
         x = "year",
         y = "O3 Mean")

f5=ggplot(aes(x='year', y='SO2 AQI'), data=data_avg) +\
    geom_violin() +\
    labs(title="SO2 AQI over time",
         x = "year",
         y = "SO2 AQI")

f6=ggplot(aes(x='year', y='CO AQI'), data=data_avg) +\
    geom_violin() +\
    labs(title="CO AQI over time",
         x = "year",
         y = "CO AQI")
    
f7=ggplot(aes(x='year', y='NO2 AQI'), data=data_avg) +\
    geom_violin() +\
    labs(title="NO2 AQI over time",
         x = "year",
         y = "NO2 AQI")
    
f8=ggplot(aes(x='year', y='O3 AQI'), data=data_avg) +\
    geom_violin() +\
    labs(title="O3 AQI over time",
         x = "year",
         y = "O3 AQI")
    
    
    
ggplot.show(f1)
ggplot.show(f2)
ggplot.show(f3)
ggplot.show(f4)
ggplot.show(f5)
ggplot.show(f6)
ggplot.show(f7)
ggplot.show(f8)

Pollutant Concentration vs Year Analysis

From the CO and SO2 Mean violin plots, we can see that in 2000 the distribution has a lot of variation but over the years variation decreases. However, the NO2 and O3 mean violin plots have high variation all throughout 2000 - 2016 and stay fairly constant. The AQI vs Year violin plots show similar respective results as the pollutant mean violin plots.

Choreopleth Map

Another way to visualize this data is with a choropleth map. A choropleth map is a thematic map which areas are shaded in proportion to the measurement of the statistic being displayed on the map. In this case, the map is of the US and each state is shaded in proportion to the average concentration of each pollutant. The higher the concentration the darker the shade. Below displays the choropleth maps of years 2000, 2008, and 2016. We will be using the plotly package which requires an API key. You can register for an API key here: https://plot.ly/. Once you have an API key you can set your credentials in this line:

plotly.tools.set_credentials_file(username='YOUR_USERNAME', api_key='YOUR_API_KEY')

Plotly provides detailed tutorials/sample code and documentation. You can check them out here: https://help.plot.ly/tutorials/

Sulfur Dioxide Pollution Level by State for 2000, 2008, and 2016

In [11]:
import plotly
import plotly.plotly as py
from plotly.graph_objs import *
plotly.tools.set_credentials_file(username='vincentcheng08', api_key='sMfbyauuaKXk7h3hwjo2')

# choropleth map of CO Levels in US in 2000

data_2000 = data_avg[data_avg['year'] == '2000']
data_2008 = data_avg[data_avg['year'] == '2008']
data_2016 = data_avg[data_avg['year'] == '2016']

data_2000['text'] = data_2000['state'] + " has a SO2 Mean level of " + data_2000['SO2 Mean'].astype(str)
data_2008['text'] = data_2008['state'] + " has a SO2 Mean level of " + data_2008['SO2 Mean'].astype(str)
data_2016['text'] = data_2016['state'] + " has a SO2 Mean level of " + data_2016['SO2 Mean'].astype(str)

scl = [[0.0, 'rgb(242,240,247)'],[0.2, 'rgb(218,218,235)'],[0.4, 'rgb(188,189,220)'],\
            [0.6, 'rgb(158,154,200)'],[0.8, 'rgb(117,107,177)'],[1.0, 'rgb(84,39,143)']]

map_data = [ dict(
        type='choropleth',
        colorscale = scl,
        autocolorscale = False,
        locations = data_2000['state'],
        z = data_2000['SO2 Mean'].astype(float),
        locationmode = 'USA-states',
        text = data_2000['text'],
        marker = dict(
            line = dict (
                color = 'rgb(255,255,255)',
                width = 2
            )
        ),
        colorbar = dict(
            title = "Parts Per Billion"
        )
    ) ]

layout = dict(
        title = '2000 US Sulfur Dioxide Pollution Levels by State<br>(Hover for breakdown)',
        geo = dict(
            scope='usa',
            projection=dict( type='albers usa' ),
            showlakes = True,
            lakecolor = 'rgb(255, 255, 255)',
        ),
    )

fig = dict(data=map_data, layout=layout)

py.iplot(fig, filename='d3-cloropleth-map')
/opt/conda/lib/python3.6/site-packages/ipykernel_launcher.py:12: SettingWithCopyWarning:


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy

/opt/conda/lib/python3.6/site-packages/ipykernel_launcher.py:13: SettingWithCopyWarning:


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy

/opt/conda/lib/python3.6/site-packages/ipykernel_launcher.py:14: SettingWithCopyWarning:


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy

High five! You successfully sent some data to your account on plotly. View your plot in your browser at https://plot.ly/~vincentcheng08/0 or inside your plot.ly account where it is named 'd3-cloropleth-map'
Out[11]:
In [12]:
# choropleth map of CO Levels in US in 2008
map_data[0]["z"] = data_2008['SO2 Mean'].astype(float)
map_data[0]["locations"] = data_2008['state']
map_data[0]["text"] = data_2008['text']
layout["title"] = '2008 US Sulfur Dioxide Pollution Levels by State<br>(Hover for breakdown)'
fig = dict(data=map_data, layout=layout)
py.iplot(fig, filename='d3-cloropleth-map')
High five! You successfully sent some data to your account on plotly. View your plot in your browser at https://plot.ly/~vincentcheng08/0 or inside your plot.ly account where it is named 'd3-cloropleth-map'
Out[12]:
In [13]:
# choropleth map of CO Levels in US in 2016
map_data[0]["z"] = data_2016['SO2 Mean'].astype(float)
map_data[0]["locations"] = data_2016['state']
map_data[0]["text"] = data_2016['text']
layout["title"] = '2016 US Sulfur Dioxide Pollution Levels by State<br>(Hover for breakdown)'
fig = dict(data=map_data, layout=layout)
py.iplot(fig, filename='d3-cloropleth-map')
High five! You successfully sent some data to your account on plotly. View your plot in your browser at https://plot.ly/~vincentcheng08/0 or inside your plot.ly account where it is named 'd3-cloropleth-map'
Out[13]:

Sulfur Dioxide Levels by State Observation

In 2000, Virginia has the highest Sulfur Dioxide Concentration. In 2008, New York has the highest level and in 2016 Ohio has the highest level.

Carbon Monoxide Pollution Levels by state for 2000, 2008, and 2016

In [14]:
data_2000['text'] = data_2000['state'] + " has a CO Mean level of " + data_2000['CO Mean'].astype(str)
data_2008['text'] = data_2008['state'] + " has a CO Mean level of " + data_2008['CO Mean'].astype(str)
data_2016['text'] = data_2016['state'] + " has a CO Mean level of " + data_2016['CO Mean'].astype(str)

scl = [[0.0, 'rgb(242,240,247)'],[0.2, 'rgb(218,218,235)'],[0.4, 'rgb(188,189,220)'],\
            [0.6, 'rgb(158,154,200)'],[0.8, 'rgb(117,107,177)'],[1.0, 'rgb(84,39,143)']]

map_data = [ dict(
        type='choropleth',
        colorscale = scl,
        autocolorscale = False,
        locations = data_2000['state'],
        z = data_2000['CO Mean'].astype(float),
        locationmode = 'USA-states',
        text = data_2000['text'],
        marker = dict(
            line = dict (
                color = 'rgb(255,255,255)',
                width = 2
            )
        ),
        colorbar = dict(
            title = "Parts Per Billion"
        )
    ) ]

layout = dict(
        title = '2000 US Carbon Monoxide Pollution Levels by State<br>(Hover for breakdown)',
        geo = dict(
            scope='usa',
            projection=dict( type='albers usa' ),
            showlakes = True,
            lakecolor = 'rgb(255, 255, 255)',
        ),
    )

fig = dict(data=map_data, layout=layout)

py.iplot(fig, filename='d3-cloropleth-map')
/opt/conda/lib/python3.6/site-packages/ipykernel_launcher.py:1: SettingWithCopyWarning:


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy

/opt/conda/lib/python3.6/site-packages/ipykernel_launcher.py:2: SettingWithCopyWarning:


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy

/opt/conda/lib/python3.6/site-packages/ipykernel_launcher.py:3: SettingWithCopyWarning:


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy

High five! You successfully sent some data to your account on plotly. View your plot in your browser at https://plot.ly/~vincentcheng08/0 or inside your plot.ly account where it is named 'd3-cloropleth-map'
Out[14]:
In [15]:
map_data[0]["z"] = data_2008['CO Mean'].astype(float)
map_data[0]["locations"] = data_2008['state']
map_data[0]["text"] = data_2008['text']
layout["title"] = '2008 US Carbon Monoxide Pollution Levels by State<br>(Hover for breakdown)'
fig = dict(data=map_data, layout=layout)
py.iplot(fig, filename='d3-cloropleth-map')
High five! You successfully sent some data to your account on plotly. View your plot in your browser at https://plot.ly/~vincentcheng08/0 or inside your plot.ly account where it is named 'd3-cloropleth-map'
Out[15]:
In [16]:
map_data[0]["z"] = data_2016['CO Mean'].astype(float)
map_data[0]["locations"] = data_2016['state']
map_data[0]["text"] = data_2016['text']
layout["title"] = '2016 US Carbon  Monoxide Pollution Levels by State<br>(Hover for breakdown)'
fig = dict(data=map_data, layout=layout)
py.iplot(fig, filename='d3-cloropleth-map')
High five! You successfully sent some data to your account on plotly. View your plot in your browser at https://plot.ly/~vincentcheng08/0 or inside your plot.ly account where it is named 'd3-cloropleth-map'
Out[16]:

Carbon Dioxide Levels by State Observation

In 2000, Indiana has the highest Carbon Dioxide Concentration. In 2008, Arkansas has the highest level and in 2016 Florida has the highest level.

Ozone Pollution Levels by State from 2000, 2008, and 2016

In [17]:
data_2000['text'] = data_2000['state'] + " has a O3 Mean level of " + data_2000['O3 Mean'].astype(str)
data_2008['text'] = data_2008['state'] + " has a O3 Mean level of " + data_2008['O3 Mean'].astype(str)
data_2016['text'] = data_2016['state'] + " has a O3 Mean level of " + data_2016['O3 Mean'].astype(str)

scl = [[0.0, 'rgb(242,240,247)'],[0.2, 'rgb(218,218,235)'],[0.4, 'rgb(188,189,220)'],\
            [0.6, 'rgb(158,154,200)'],[0.8, 'rgb(117,107,177)'],[1.0, 'rgb(84,39,143)']]

map_data = [ dict(
        type='choropleth',
        colorscale = scl,
        autocolorscale = False,
        locations = data_2000['state'],
        z = data_2000['O3 Mean'].astype(float),
        locationmode = 'USA-states',
        text = data_2000['text'],
        marker = dict(
            line = dict (
                color = 'rgb(255,255,255)',
                width = 2
            )
        ),
        colorbar = dict(
            title = "Parts Per Billion"
        )
    ) ]

layout = dict(
        title = '2000 US Ozone Pollution Levels by State<br>(Hover for breakdown)',
        geo = dict(
            scope='usa',
            projection=dict( type='albers usa' ),
            showlakes = True,
            lakecolor = 'rgb(255, 255, 255)',
        ),
    )

fig = dict(data=map_data, layout=layout)

py.iplot(fig, filename='d3-cloropleth-map')
/opt/conda/lib/python3.6/site-packages/ipykernel_launcher.py:1: SettingWithCopyWarning:


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy

/opt/conda/lib/python3.6/site-packages/ipykernel_launcher.py:2: SettingWithCopyWarning:


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy

/opt/conda/lib/python3.6/site-packages/ipykernel_launcher.py:3: SettingWithCopyWarning:


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy

High five! You successfully sent some data to your account on plotly. View your plot in your browser at https://plot.ly/~vincentcheng08/0 or inside your plot.ly account where it is named 'd3-cloropleth-map'
Out[17]:
In [18]:
map_data[0]["z"] = data_2008['O3 Mean'].astype(float)
map_data[0]["locations"] = data_2008['state']
map_data[0]["text"] = data_2008['text']
layout["title"] = '2008 US Ozone Pollution Levels by State<br>(Hover for breakdown)'
fig = dict(data=map_data, layout=layout)
py.iplot(fig, filename='d3-cloropleth-map')
High five! You successfully sent some data to your account on plotly. View your plot in your browser at https://plot.ly/~vincentcheng08/0 or inside your plot.ly account where it is named 'd3-cloropleth-map'
Out[18]:
In [19]:
map_data[0]["z"] = data_2016['O3 Mean'].astype(float)
map_data[0]["locations"] = data_2016['state']
map_data[0]["text"] = data_2016['text']
layout["title"] = '2016 US Ozone Pollution Levels by State<br>(Hover for breakdown)'
fig = dict(data=map_data, layout=layout)
py.iplot(fig, filename='d3-cloropleth-map')
High five! You successfully sent some data to your account on plotly. View your plot in your browser at https://plot.ly/~vincentcheng08/0 or inside your plot.ly account where it is named 'd3-cloropleth-map'
Out[19]:

Ozone Levels by State Observation

In 2000, North Carolina has the highest Ozone Concentration. In 2008 and 2016, Wyoming has the highest level.

Nitrogen Dioxide Pollution Levels by State from 2000, 2008, and 2016

In [20]:
data_2000['text'] = data_2000['state'] + " has a NO2 Mean level of " + data_2000['O3 Mean'].astype(str)
data_2008['text'] = data_2008['state'] + " has a NO2 Mean level of " + data_2008['O3 Mean'].astype(str)
data_2016['text'] = data_2016['state'] + " has a NO2 Mean level of " + data_2016['O3 Mean'].astype(str)

scl = [[0.0, 'rgb(242,240,247)'],[0.2, 'rgb(218,218,235)'],[0.4, 'rgb(188,189,220)'],\
            [0.6, 'rgb(158,154,200)'],[0.8, 'rgb(117,107,177)'],[1.0, 'rgb(84,39,143)']]

map_data = [ dict(
        type='choropleth',
        colorscale = scl,
        autocolorscale = False,
        locations = data_2000['state'],
        z = data_2000['NO2 Mean'].astype(float),
        locationmode = 'USA-states',
        text = data_2000['text'],
        marker = dict(
            line = dict (
                color = 'rgb(255,255,255)',
                width = 2
            )
        ),
        colorbar = dict(
            title = "Parts Per Billion"
        )
    ) ]

layout = dict(
        title = '2000 US Nitrogen Dioxide Pollution Levels by State<br>(Hover for breakdown)',
        geo = dict(
            scope='usa',
            projection=dict( type='albers usa' ),
            showlakes = True,
            lakecolor = 'rgb(255, 255, 255)',
        ),
    )

fig = dict(data=map_data, layout=layout)

py.iplot(fig, filename='d3-cloropleth-map')
/opt/conda/lib/python3.6/site-packages/ipykernel_launcher.py:1: SettingWithCopyWarning:


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy

/opt/conda/lib/python3.6/site-packages/ipykernel_launcher.py:2: SettingWithCopyWarning:


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy

/opt/conda/lib/python3.6/site-packages/ipykernel_launcher.py:3: SettingWithCopyWarning:


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy

High five! You successfully sent some data to your account on plotly. View your plot in your browser at https://plot.ly/~vincentcheng08/0 or inside your plot.ly account where it is named 'd3-cloropleth-map'
Out[20]:
In [21]:
map_data[0]["z"] = data_2008['NO2 Mean'].astype(float)
map_data[0]["locations"] = data_2008['state']
map_data[0]["text"] = data_2008['text']
layout["title"] = '2008 US Nitrogen Dioxide Pollution Levels by State<br>(Hover for breakdown)'
fig = dict(data=map_data, layout=layout)
py.iplot(fig, filename='d3-cloropleth-map')
High five! You successfully sent some data to your account on plotly. View your plot in your browser at https://plot.ly/~vincentcheng08/0 or inside your plot.ly account where it is named 'd3-cloropleth-map'
Out[21]:
In [22]:
map_data[0]["z"] = data_2016['NO2 Mean'].astype(float)
map_data[0]["locations"] = data_2016['state']
map_data[0]["text"] = data_2016['text']
layout["title"] = '2016 US Nitrogen Dioxide Levels by State<br>(Hover for breakdown)'
fig = dict(data=map_data, layout=layout)
py.iplot(fig, filename='d3-cloropleth-map')
High five! You successfully sent some data to your account on plotly. View your plot in your browser at https://plot.ly/~vincentcheng08/0 or inside your plot.ly account where it is named 'd3-cloropleth-map'
Out[22]:

Nitrogen Dioxide by State Observation

In 2000, Arizona has the highest Nitrogen Dioxide Concentration. In 2008, Massachusetts the highest level and in 2016 Utah has the highest level.

Hypothesis testing and Machine Learning

Here we're checking if there is a linear relationship between gas concentration and AQI by running linear regressions against all the gases. All of the gas have some semblence of a linear relationship. We follow http://scikit-learn.org/stable/modules/linear_model.html#ordinary-least-squares and http://www.statsmodels.org/stable/index.html to generate a linear regression plot and summary of the statistics. CO has the highest correlation with an R^2 of .88, followed by NO2 with .82, SO2 with .69, and O3 with .59. The coefficient from COMean to COAQI is 0.0173, meaning there is a positive linear relationship between them. NO2 has as a higher effect on AQI with a coefficient of 1.4443. SO2 has a coefficient of 3.5809, the highest out of the 4, and O3 has a coef of 1.3374. However, since the O3 model has a low R^2 of .59, we hesist to say there is a linear correlation between the O3 Mean and O3 AQI.

In [23]:
from sklearn import linear_model
import statsmodels.formula.api as smf
import statsmodels.api as sm
In [24]:
#NO2 Lin reg model
# Get a lin reg model
regr = linear_model.LinearRegression()
# Fit the model to the data
regr.fit(data['NO2Mean'].values.reshape(-1, 1), data['NO2AQI'])
# Get predicted live expectancies based on the model
Predicted_NO2_AQI = regr.predict(data['NO2Mean'].values.reshape(-1, 1))
# Plot data and the lin reg
plt.scatter(data['NO2Mean'], data['NO2AQI'],  color='lightgreen')
plt.plot(data['NO2Mean'], Predicted_NO2_AQI, color='darkblue', linewidth=1)
plt.show()
regression = smf.ols(formula='NO2AQI ~ NO2Mean', data=data).fit()
print(regression.summary())
                            OLS Regression Results                            
==============================================================================
Dep. Variable:                 NO2AQI   R-squared:                       0.820
Model:                            OLS   Adj. R-squared:                  0.820
Method:                 Least Squares   F-statistic:                 1.987e+06
Date:                Fri, 15 Dec 2017   Prob (F-statistic):               0.00
Time:                        18:49:54   Log-Likelihood:            -1.4334e+06
No. Observations:              436876   AIC:                         2.867e+06
Df Residuals:                  436874   BIC:                         2.867e+06
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept      5.3793      0.016    328.927      0.000       5.347       5.411
NO2Mean        1.4443      0.001   1409.664      0.000       1.442       1.446
==============================================================================
Omnibus:                   112531.790   Durbin-Watson:                   1.179
Prob(Omnibus):                  0.000   Jarque-Bera (JB):           540799.275
Skew:                           1.174   Prob(JB):                         0.00
Kurtosis:                       7.919   Cond. No.                         26.9
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
In [25]:
#O3 Lin reg model
from sklearn import linear_model
# Get a lin reg model
regr = linear_model.LinearRegression()
# Fit the model to the data
regr.fit(data['O3Mean'].values.reshape(-1, 1), data['O3AQI'])
# Get predicted live expectancies based on the model
Predicted_O3_AQI = regr.predict(data['O3Mean'].values.reshape(-1, 1))
# Plot data and the lin reg
plt.scatter(data['O3Mean'], data['O3AQI'],  color='lightgreen')
plt.plot(data['O3Mean'], Predicted_O3_AQI, color='darkblue', linewidth=1)
plt.show()
regression = smf.ols(formula='O3AQI ~ O3Mean', data=data).fit()
print(regression.summary())
                            OLS Regression Results                            
==============================================================================
Dep. Variable:                  O3AQI   R-squared:                       0.591
Model:                            OLS   Adj. R-squared:                  0.591
Method:                 Least Squares   F-statistic:                 6.312e+05
Date:                Fri, 15 Dec 2017   Prob (F-statistic):               0.00
Time:                        18:49:57   Log-Likelihood:            -1.7285e+06
No. Observations:              436876   AIC:                         3.457e+06
Df Residuals:                  436874   BIC:                         3.457e+06
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept      1.1116      0.048     23.177      0.000       1.018       1.206
O3Mean         1.3374      0.002    794.453      0.000       1.334       1.341
==============================================================================
Omnibus:                   275985.147   Durbin-Watson:                   0.705
Prob(Omnibus):                  0.000   Jarque-Bera (JB):          4323818.396
Skew:                           2.811   Prob(JB):                         0.00
Kurtosis:                      17.350   Cond. No.                         71.5
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
In [26]:
#SO2 Lin reg model
from sklearn import linear_model
# Get a lin reg model
regr = linear_model.LinearRegression()
# Fit the model to the data
regr.fit(data['SO2Mean'].values.reshape(-1, 1), data['SO2AQI'])
# Get predicted live expectancies based on the model
Predicted_SO2_AQI = regr.predict(data['SO2Mean'].values.reshape(-1, 1))
# Plot data and the lin reg
plt.scatter(data['SO2Mean'], data['SO2AQI'],  color='lightgreen')
plt.plot(data['SO2Mean'], Predicted_SO2_AQI, color='darkblue', linewidth=1)
plt.show()
# print coef, intercept, R^2
regression = smf.ols(formula='SO2AQI ~ SO2Mean', data=data).fit()
print(regression.summary())
                            OLS Regression Results                            
==============================================================================
Dep. Variable:                 SO2AQI   R-squared:                       0.686
Model:                            OLS   Adj. R-squared:                  0.686
Method:                 Least Squares   F-statistic:                 9.552e+05
Date:                Fri, 15 Dec 2017   Prob (F-statistic):               0.00
Time:                        18:50:01   Log-Likelihood:            -1.4501e+06
No. Observations:              436876   AIC:                         2.900e+06
Df Residuals:                  436874   BIC:                         2.900e+06
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept      0.3793      0.012     30.978      0.000       0.355       0.403
SO2Mean        3.5809      0.004    977.347      0.000       3.574       3.588
==============================================================================
Omnibus:                   435754.926   Durbin-Watson:                   1.241
Prob(Omnibus):                  0.000   Jarque-Bera (JB):      17102860422.984
Skew:                          -3.145   Prob(JB):                         0.00
Kurtosis:                     972.286   Cond. No.                         4.17
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
In [27]:
#CO Lin reg model
from sklearn import linear_model
# Get a lin reg model
regr = linear_model.LinearRegression()
# Fit the model to the data
regr.fit(data['COMean'].values.reshape(-1, 1), data['COAQI'])
# Get predicted live expectancies based on the model
Predicted_CO_AQI = regr.predict(data['COMean'].values.reshape(-1, 1))
# Plot data and the lin reg
plt.scatter(data['COMean'], data['COAQI'],  color='lightgreen')
plt.plot(data['COMean'], Predicted_CO_AQI, color='darkblue', linewidth=1)
plt.show()
# print coef, intercept, R^2
regression = smf.ols(formula='COAQI ~ COMean', data=data).fit()
print(regression.summary())
                            OLS Regression Results                            
==============================================================================
Dep. Variable:                  COAQI   R-squared:                       0.878
Model:                            OLS   Adj. R-squared:                  0.878
Method:                 Least Squares   F-statistic:                 3.138e+06
Date:                Fri, 15 Dec 2017   Prob (F-statistic):               0.00
Time:                        18:50:05   Log-Likelihood:            -9.3260e+05
No. Observations:              436876   AIC:                         1.865e+06
Df Residuals:                  436874   BIC:                         1.865e+06
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept     -0.4108      0.005    -86.283      0.000      -0.420      -0.401
COMean         0.0173   9.78e-06   1771.308      0.000       0.017       0.017
==============================================================================
Omnibus:                   421800.739   Durbin-Watson:                   0.944
Prob(Omnibus):                  0.000   Jarque-Bera (JB):         85701724.344
Skew:                           4.195   Prob(JB):                         0.00
Kurtosis:                      71.100   Cond. No.                         748.
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

Conclusion

In conclusion, America has reduced it's pollution output in the last decade. This means that the air that Americans breath in is healthier for them.